The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words

نویسندگان

  • Mohammed Attia
  • Younes Samih
  • Khaled F. Shaalan
  • Josef van Genabith
چکیده

Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP applications. Unknown words make up 29 % of the word types in in a large Arabic corpus used in this study. With today's corpus sizes exceeding 10 9 words, it becomes impossible to manually check corpora for new words to be included in a lexicon. We develop a finite-state morphological guesser and integrate it with a machine-learning-based pre-annotation tool in a pipeline architecture for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexical database. The processing is performed on a corpus of contemporary Arabic of 1,089,111,204 words. Our method is tested on a manually-annotated gold standard and yields encouraging results despite the complexity of the task. Our work shows the usability of a highly non-deterministic morphological guesser in a practical and complex application.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

This paper deals with the automatic construction of a lemmatizer from a Full Form Lemma (FFL) training dictionary and with lemmatization of new, in the FFL dictionary unseen, i.e. out-ofvocabulary (OOV) words. Three methods of lemmatization of three kinds of OOV words (missing full forms, unknown words, and compound words) are introduced. These methods were tested on Czech test data. The best r...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

Handling Unknown Words in Arabic FST Morphology

A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...

متن کامل

A New Similarity Measure for Automatic Construction of the Unknown Word Lexical Dictionary

This paper deals with research that automatically constructs a lexical dictionary of unknown words as an automatic lexical dictionary expansion. The lexical dictionary has been usefully applied to various fields for semantic information processing. It has limitations in which it only processes terms defined in the dictionary. Under this circumstance, the concept of “Unknown Word (UW)” is define...

متن کامل

Iranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels

Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012